{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tutorial for Chinese Sentiment analysis with hotel review data\n",
    "## Dependencies\n",
    "\n",
    "Python 3.5, numpy, pickle, keras, tensorflow, [jieba](https://github.com/fxsjy/jieba)\n",
    "\n",
    "## Optional for plotting\n",
    "\n",
    "pylab, scipy\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Using TensorFlow backend.\n"
     ]
    }
   ],
   "source": [
    "from os import listdir\n",
    "from os.path import isfile, join\n",
    "import jieba\n",
    "import codecs\n",
    "from langconv import * # convert Traditional Chinese characters to Simplified Chinese characters\n",
    "import pickle\n",
    "import random\n",
    "\n",
    "from keras.models import Sequential\n",
    "from keras.layers.embeddings import Embedding\n",
    "from keras.layers.recurrent import GRU\n",
    "from keras.preprocessing.text import Tokenizer\n",
    "from keras.layers.core import Dense\n",
    "from keras.utils import to_categorical\n",
    "from keras.preprocessing.sequence import pad_sequences\n",
    "from keras.callbacks import TensorBoard"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Helper function to pickle and load stuff"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "\n",
    "def __pickleStuff(filename, stuff):\n",
    "    save_stuff = open(filename, \"wb\")\n",
    "    pickle.dump(stuff, save_stuff)\n",
    "    save_stuff.close()\n",
    "def __loadStuff(filename):\n",
    "    saved_stuff = open(filename,\"rb\")\n",
    "    stuff = pickle.load(saved_stuff)\n",
    "    saved_stuff.close()\n",
    "    return stuff"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Get lists of files, positive and negative files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "dataBaseDirPos = \"./data/ChnSentiCorp_htl_ba_6000/pos/\"\n",
    "dataBaseDirNeg = \"./data/ChnSentiCorp_htl_ba_6000/neg/\"\n",
    "positiveFiles = [dataBaseDirPos + f for f in listdir(dataBaseDirPos) if isfile(join(dataBaseDirPos, f))]\n",
    "negativeFiles = [dataBaseDirNeg + f for f in listdir(dataBaseDirNeg) if isfile(join(dataBaseDirNeg, f))]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Show length of samples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2916\n",
      "3000\n"
     ]
    }
   ],
   "source": [
    "print(len(positiveFiles))\n",
    "print(len(negativeFiles))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Have a look at what's in a file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "距离川沙公路较近,但是公交指示不对,如果是\"蔡陆线\"的话,会非常麻烦.建议用别的路线.房间较为简单.\n",
      "\n",
      "\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "filename = positiveFiles[0]\n",
    "with codecs.open(filename, \"r\") as doc_file:\n",
    "    text=doc_file.read()\n",
    "    print(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Test stop words\n",
    "Demo what it looks like to tokenize the sentence and remove stop words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Building prefix dict from the default dictionary ...\n",
      "Loading model from cache C:\\Users\\hasee\\AppData\\Local\\Temp\\jieba.cache\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "==Orginal==:\n",
      "\r",
      "一年多没来过了，再来住，感觉不错，旧瓶装了新酒，标准房里面的设施基本上都换新了，到处都见到笑脸，看来这里才真正是买方市场，我喜欢员工“巴结”客人的感觉，给他个满分。\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Loading model cost 0.734 seconds.\n",
      "Prefix dict has been built succesfully.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "==Tokenized==\tToken count:56\n",
      "\r",
      "一年 多 没来 过 了 ， 再来 住 ， 感觉 不错 ， 旧 瓶装 了 新 酒 ， 标准 房 里面 的 设施 基本上 都 换 新 了 ， 到处 都 见到 笑脸 ， 看来 这里 才 真正 是 买方市场 ， 我 喜欢 员工 “ 巴结 ” 客人 的 感觉 ， 给 他 个 满分 。\n",
      "==Stop Words Removed==\tToken count:24\n",
      "\r",
      "一年 没来 再来 住 感觉 不错 旧 瓶装 新 酒 标准 房 设施 换 新 见到 笑脸 买方市场 喜欢 员工 巴结 客人 感觉 满分\n"
     ]
    }
   ],
   "source": [
    "filename = positiveFiles[110]\n",
    "with codecs.open(filename, \"r\") as doc_file:\n",
    "    text=doc_file.read()\n",
    "    text = text.replace(\"\\n\", \"\")\n",
    "    text = text.replace(\"\\r\", \"\")\n",
    "print(\"==Orginal==:\\n\\r{}\".format(text))\n",
    "    \n",
    "stopwords = [ line.rstrip() for line in codecs.open('./data/chinese_stop_words.txt',\"r\", encoding=\"utf-8\") ]\n",
    "seg_list = jieba.cut(text, cut_all=False)\n",
    "final =[]\n",
    "seg_list = list(seg_list)\n",
    "for seg in seg_list:\n",
    "    if seg not in stopwords:\n",
    "        final.append(seg)\n",
    "print(\"==Tokenized==\\tToken count:{}\\n\\r{}\".format(len(seg_list),\" \".join(seg_list)))\n",
    "print(\"==Stop Words Removed==\\tToken count:{}\\n\\r{}\".format(len(final),\" \".join(final)))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prepare \"doucments\", a list of tuples\n",
    "Some files contain abnormal encoding characters which encoding GB2312 will complain about. Solution: read as bytes then decode as GB2312 line by line, skip lines with abnormal encodings. We also convert any traditional Chinese characters to simplified Chinese characters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "documents = []\n",
    "for filename in positiveFiles:\n",
    "    text = \"\"\n",
    "    with codecs.open(filename, \"rb\") as doc_file:\n",
    "        for line in doc_file:\n",
    "            try:\n",
    "                line = line.decode(\"GB2312\")\n",
    "            except:\n",
    "                continue\n",
    "            text+=Converter('zh-hans').convert(line)# Convert from traditional to simplified Chinese\n",
    "\n",
    "            text = text.replace(\"\\n\", \"\")\n",
    "            text = text.replace(\"\\r\", \"\")\n",
    "    documents.append((text, \"pos\"))\n",
    "\n",
    "for filename in negativeFiles:\n",
    "    text = \"\"\n",
    "    with codecs.open(filename, \"rb\") as doc_file:\n",
    "        for line in doc_file:\n",
    "            try:\n",
    "                line = line.decode(\"GB2312\")\n",
    "            except:\n",
    "                continue\n",
    "            text+=Converter('zh-hans').convert(line)# Convert from traditional to simplified Chinese\n",
    "\n",
    "            text = text.replace(\"\\n\", \"\")\n",
    "            text = text.replace(\"\\r\", \"\")\n",
    "    documents.append((text, \"neg\"))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Optional step to save/load the documents as pickle file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5916\n",
      "('酒店的房间装修得飞常好，空间也很大，可是我非常不能理解的是，300多一晚的酒店居然房间不提供免费上网。另外，枕巾洗得不干净，说实话下次不考虑住了。', 'neg')\n"
     ]
    }
   ],
   "source": [
    "# Uncomment those two lines to save/load the documents for later use since the step above takes a while\n",
    "# __pickleStuff(\"./data/chinese_sentiment_corpus.p\", documents)\n",
    "# documents = __loadStuff(\"./data/chinese_sentiment_corpus.p\")\n",
    "print(len(documents))\n",
    "print(documents[4000])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## shuffle the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "random.shuffle(documents)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prepare the input and output for the model\n",
    "Each input will be a list of tokens, output will be one token, either \"pos\" or \"neg\". The stopwords are not removed here since the data set are relative small and removing the stop words are not saving much traing time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Tokenize only\n",
    "totalX = []\n",
    "totalY = [str(doc[1]) for doc in documents]\n",
    "for doc in documents:\n",
    "    seg_list = jieba.cut(doc[0], cut_all=False)\n",
    "    seg_list = list(seg_list)\n",
    "    totalX.append(seg_list)\n",
    "\n",
    "\n",
    "#Switch to below code to experiment with removing stop words\n",
    "# Tokenize and remove stop words\n",
    "# totalX = []\n",
    "# totalY = [str(doc[1]) for doc in documents]\n",
    "# stopwords = [ line.rstrip() for line in codecs.open('./data/chinese_stop_words.txt',\"r\", encoding=\"utf-8\") ]\n",
    "# for doc in documents:\n",
    "#     seg_list = jieba.cut(doc[0], cut_all=False)\n",
    "#     seg_list = list(seg_list)\n",
    "#     Uncomment below code to experiment with removing stop words\n",
    "#     final =[]\n",
    "#     for seg in seg_list:\n",
    "#         if seg not in stopwords:\n",
    "#             final.append(seg)\n",
    "#     totalX.append(final)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualize distribution of token length across input sentences\n",
    "Decide the max input sequence, here we cover up to 60% sentences. The longer input sequence, the more training time will take, but could improve  prediction accuracy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Max length is:  1804\n",
      "60% cover length up to:  68\n"
     ]
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYQAAAD8CAYAAAB3u9PLAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAHqZJREFUeJzt3X+QXWWd5/H3x+5JGnQJGnrZIcluZ00cKlAzgr0Rdcqy\njGIQh1C7WIQBzcxEqd2FVQed2WRVFhlnJTWugEVkJxIVIWXAjMP0DgzxR9jackdjGnTQANGW9Ew6\n4tjEGATtxI7f/eM8DTc39/Y99/b9fT+vqi7uOec55z7nkD6ffp7n/FBEYGZm9qJWV8DMzNqDA8HM\nzAAHgpmZJQ4EMzMDHAhmZpY4EMzMDHAgmJlZ4kAwMzPAgWBmZkl/qytQjTPOOCOGhoZaXQ0zs47y\n8MMPPx0Rg5XKdVQgDA0NMTo62upqmJl1FEn/mKecu4zMzAxwIJiZWeJAMDMzIGcgSFotaZ+kMUkb\nSiyfL+metHy3pKE0f6GkhyQ9K+m2onXmSdoi6fuSnpD0H+qxQ2ZmVpuKg8qS+oDNwJuBCWCPpJGI\neKyg2HrgcEQsk7QW2ARcDkwBHwbOTT+FPgj8JCJeIelFwMvmvDdmZlazPC2ElcBYRDwZEceA7cCa\nojJrgDvT5x3AKkmKiOci4utkwVDsj4CPAUTEryPi6Zr2wMzM6iJPICwCDhRMT6R5JctExDRwBFhY\nboOSTk8f/0zSI5K+KOnMMmWvljQqaXRycjJHdc3MrBatGlTuBxYDfx8R5wPfAD5eqmBEbImI4YgY\nHhyseF+FmZnVKE8gHASWFEwvTvNKlpHUDywADs2yzUPAL4AvpekvAufnqIuZmTVInjuV9wDLJS0l\nO/GvBX6/qMwIsI7sL/3LgF0REeU2GBEh6X8DbwB2AauAx8qV72g3LGjhdx9p3XebWcepGAgRMS3p\nWmAn0Ad8JiL2SroRGI2IEWArcJekMeCnZKEBgKRx4DRgnqRLgQvTFUr/Na1zCzAJ/GF9d83MzKqR\n61lGEfEA8EDRvOsLPk8Bby+z7lCZ+f8IvD5vRc3MrLF8p7KZmQEOBDMzSxwIZmYGOBDMzCxxIJiZ\nGeBAMDOzxIFgZmaAA8HMzBIHgpmZAQ4EMzNLHAhmZgY4EMzMLHEgmJkZ4EAwM7PEgWBmZoADwczM\nEgeCmZkBOQNB0mpJ+ySNSdpQYvl8Sfek5bslDaX5CyU9JOlZSbeV2faIpO/NZSfMzGzuKgaCpD5g\nM3ARsAK4QtKKomLrgcMRsQy4GdiU5k8BHwY+UGbb/x54traqm5lZPeVpIawExiLiyYg4BmwH1hSV\nWQPcmT7vAFZJUkQ8FxFfJwuGE0h6CXAd8NGaa29mZnWTJxAWAQcKpifSvJJlImIaOAIsrLDdPwP+\nJ/CLXDU1M7OGasmgsqRXAi+PiL/OUfZqSaOSRicnJ5tQOzOz3pQnEA4CSwqmF6d5JctI6gcWAIdm\n2eZrgGFJ48DXgVdI+j+lCkbElogYjojhwcHBHNU1M7Na5AmEPcBySUslzQPWAiNFZUaAdenzZcCu\niIhyG4yI2yPirIgYAn4X+H5EvKHaypuZWf30VyoQEdOSrgV2An3AZyJir6QbgdGIGAG2AndJGgN+\nShYaAKRWwGnAPEmXAhdGxGP13xUzM5uLioEAEBEPAA8Uzbu+4PMU8PYy6w5V2PY4cG6eepiZWeP4\nTmUzMwMcCGZmljgQzMwMcCCYmVniQDAzM8CBYGZmiQPBzMwAB4KZmSUOBDMzA3LeqWwd6oYFLfre\nI635XjObE7cQzMwMcCCYmVniQDAzM8CBYGZmiQPBzMwAB4KZmSUOBDMzA3IGgqTVkvZJGpO0ocTy\n+ZLuSct3SxpK8xdKekjSs5JuKyh/qqT7JT0haa+km+q1Q2ZmVpuKgSCpD9gMXASsAK6QtKKo2Hrg\ncEQsA24GNqX5U8CHgQ+U2PTHI+Js4DzgdZIuqm0XzMysHvK0EFYCYxHxZEQcA7YDa4rKrAHuTJ93\nAKskKSKei4ivkwXD8yLiFxHxUPp8DHgEWDyH/TAzsznKEwiLgAMF0xNpXskyETENHAEW5qmApNOB\n3wO+lqe8mZk1RksHlSX1A18APhkRT5Ypc7WkUUmjk5OTza2gmVkPyfNwu4PAkoLpxWleqTIT6SS/\nADiUY9tbgB9ExC3lCkTEllSO4eHhyLFN6zBDG+6vy3bGb7q4Ltsx61V5AmEPsFzSUrIT/1rg94vK\njADrgG8AlwG7ImLWk7ekj5IFx7uqrbR1tnoFQKXtOiDMqlMxECJiWtK1wE6gD/hMROyVdCMwGhEj\nwFbgLkljwE/JQgMASePAacA8SZcCFwLPAB8EngAekQRwW0TcUc+ds/bSqCCo9H0OBrN8cr0PISIe\nAB4omnd9wecp4O1l1h0qs1nlq6J1smaHwGx1cDCYzc4vyLGGaIcgKFZYJ4eD2ckcCFZXQ1N3QxuG\nQTG3GsxO5kCwuhiauptO7AV0MJi9wIFgc/ZCGMw9EKo9Mdera8rBYOZAsDmaaxjM9QRcvP5cA2Jo\nw/0OBetZDgSryYldRNWFQSNPuDPbnkswOBSsVzkQrGq1tgqaeZIt/K5awsFdSNaL/IIcq0otYTB+\n08UtPbHO5fvb8fJZs0ZxC8EqOvkKonxh0G5/XdfaneQuJOsVDgSbVfUtggDU1ifQWoLBN7VZL3CX\nkZVVXRjE8z+dcsJ0N5LZiRwIVlItYTA+cCXjA1c1tmJ15lAwe4EDwU5SWxh0VhAUqnXQ2aFg3caB\nYCfIFwZxwk8nh0Ehh4L1OgeCPa+aMMi6hzqvi6gSh4L1MgeCAdWGQXeFQLGZLqRqwsGhYN3AgWAO\ng1k4FKyX5AoESasl7ZM0JmlDieXzJd2Tlu+WNJTmL5T0kKRnJd1WtM6rJH03rfNJpfdoWnM5DCpz\nKFivqBgIkvqAzcBFwArgCkkrioqtBw5HxDLgZmBTmj8FfBj4QIlN3w68G1ieflbXsgNWO4dBfg4F\n6wV5WggrgbGIeDIijgHbgTVFZdYAd6bPO4BVkhQRz0XE18mC4XmSfhM4LSK+GREBfB64dC47YtVx\nGFSvU264M6tVnkBYBBwomJ5I80qWiYhp4AiwsMI2Jyps0xrEYVC7vKHgVoJ1orYfVJZ0taRRSaOT\nk5Otrk7HcxjMnUPBulWeQDgILCmYXpzmlSwjqR9YAByqsM3FFbYJQERsiYjhiBgeHBzMUV0rZWjq\nboamtuEwqA+HgnWjPIGwB1guaamkecBaYKSozAiwLn2+DNiVxgZKioingGckXZCuLnon8DdV195y\nObFV4DCoF4eCdZuKgZDGBK4FdgKPA/dGxF5JN0q6JBXbCiyUNAZcBzx/aaqkceATwB9Imii4Quk/\nA3cAY8APgb+rzy7ZyfI8l8hhUItqQsHBYO0u1/sQIuIB4IGiedcXfJ4C3l5m3aEy80eBc/NW1GqT\ntQ4qcRg0i1+2Y+2s7QeVrXaVB5C77wF1rVDtCd4tBWtXmqWrv+0MDw/H6Ohoq6tRnRsWtORr84ZB\n1wXBDUda9tXVnujdUrBmkfRwRAxXKucWQhfq2TBoMZ/grdM5ELpS5UFkh0FjVPOUVHcdWbtxIHSZ\nyoPIM+MG1kgOBetEDoQu4q6i9uJQsE7jQOgSDoP25FCwTuJA6AIOg/bmwWbrFA6EruBB5G7gVoK1\nmgOhw3kQuTO468g6gQOhg7mrqLO468janQOhQzkMOlOeUHArwVrFgdCBHAadzaFg7cqB0JE8iNwL\nHArWbA6EDvLCW89m40HkTuBBZmtHDoQOke+tZ+4q6iQeZLZ240DoAJXHDMBh0L38tjVrllyBIGm1\npH2SxiRtKLF8vqR70vLdkoYKlm1M8/dJekvB/D+WtFfS9yR9QdJAPXao2/glN92tmlaCQ8EarWIg\nSOoDNgMXASuAKwreizxjPXA4IpYBNwOb0rorgLXAOcBq4FOS+iQtAt4DDEfEuUBfKmcnyTOAfKXD\noIM5FKxd5GkhrATGIuLJiDgGbAfWFJVZA9yZPu8AVklSmr89Io5GxH5gLG0Psvc5nyKpHzgV+NHc\ndqX7+C7k3uFQsHaQJxAWAQcKpifSvJJlImIaOAIsLLduRBwEPg78E/AUcCQivlzLDnQr32vQe6p5\nuY5ZI7RkUFnSS8laD0uBs4AXSyp5ZpN0taRRSaOTk5PNrGaLOQysPLcSrBHyBMJBYEnB9OI0r2SZ\n1AW0ADg0y7pvAvZHxGRE/Ar4EvDaUl8eEVsiYjgihgcHB3NUt/PN3lXkMOh2vkfBWiVPIOwBlkta\nKmke2eDvSFGZEWBd+nwZsCsiIs1fm65CWgosB75F1lV0gaRT01jDKuDxue9O58tzianDoPu568ha\noWIgpDGBa4GdZCfteyNir6QbJV2Sim0FFkoaA64DNqR19wL3Ao8BDwLXRMTxiNhNNvj8CPDdVI8t\ndd2zjpXnElPrBX7mkTWbsj/kO8Pw8HCMjo62uhrVuWFB7qKztw7cVZTLDUdaXYO6y3PSd4vCZiPp\n4YgYrlSuvxmVscrcVVQnVQRwfb+3+4LIeo8fXdEG8t+NbL3IXUfWLA6EFvP9BpaHQ8GawYHQcn63\ngdWPQ8HmwoHQIn63gVXL9ydYozkQWsDvNrBa+WoiayQHQkv43QbWWG4lWC0cCE2W57EUDgObjbuO\nrFEcCE2U714Dv9vAKnPXkTWCA6FJfK+B1VveS1HdUrC8HAhN4HsNrFHcfWT15EBoCt9rYK3nULBK\nHAgN5tdgWqP59ZtWLw6EBsp++dxVZI3n129aPTgQGs5hYO3FrQQrx4HQILP/0jkMrDE8yGxz4UBo\ngFwvNHEYWIO468hq5Tem1VnlMHDrwOqszMt5/KY1m5H3jWm5WgiSVkvaJ2lM0oYSy+dLuict3y1p\nqGDZxjR/n6S3FMw/XdIOSU9IelzSa/LtWvtyGFg78TsUrFoVA0FSH7AZuAhYAVwhaUVRsfXA4YhY\nBtwMbErrrgDWAucAq4FPpe0B3Ao8GBFnA78DPD733Wl/DgMza1d5WggrgbGIeDIijgHbgTVFZdYA\nd6bPO4BVkpTmb4+IoxGxHxgDVkpaALwe2AoQEcci4mdz353Wyds6MGsmtxKsGnkCYRFwoGB6Is0r\nWSYipoEjwMJZ1l0KTAKflfRtSXdIenGpL5d0taRRSaOTk5M5qtt87iqyduZQsLxadZVRP3A+cHtE\nnAc8B5w0NgEQEVsiYjgihgcHB5tZxzpxGFhncChYnkA4CCwpmF6c5pUsI6kfWAAcmmXdCWAiInan\n+TvIAqKj5HuSpMPAWs/3J1geeQJhD7Bc0lJJ88gGiUeKyowA69Lny4BdkV3POgKsTVchLQWWA9+K\niB8DByT9VlpnFfDYHPelqfL+4jgMrF34ElOrpGIgpDGBa4GdZFcC3RsReyXdKOmSVGwrsFDSGHAd\nqfsnIvYC95Kd7B8EromI42md/wJsk/Qo8Ergf9Rvtxrryk9/o9VVMGsYtxJ6l29Mq9KVn/4G/++H\nP81Vdvymi+GGBQ2ukfW8MjemlZO7desWRdeo641p9oI8YeAnT1o7q2Y8wa2F3tLf6go0TR3+Un/1\n1K3AGVR8DaZbBdbmxm+6OPfJfmjD/f4Dp0e4hZDTq6du5Z9zhIEHka1T+CRvxRwIOZw9dUfFMHgd\njzoMrGu566g3OBAqOHvqDqY4hUphsG1gUzOrZVYXfv2mFXIgVDB7GGQcBtbJqrkIwqHQ3RwIs/jQ\n0XUVSgRn8nRT6mLWaA4FcyCU8aGj67g7LmS2rqIzeZrdA+9tZrXMGsoDzb3NgVBCnjAY4JcOA+tZ\nbiV0JwdCkbxh8MTAu5pZLbOmcddR73IgFPlCvInZBpH7+LXDwLqeQ6E3ORAK3Df9Wo7PekiCK/TV\nptXHrJUcCr3HgZDcN/1aNk6/m9m6iq7Sl/no/DvLLDfrPh5k7i0OhOQvpi/nl8wvs9RhYDYbtxK6\ngwMh+RFnlFniMLDe5q6j3uFAAA7HS+jjeMlli3jaYWA9z6HQG3o2ED50dB0vn7qLoaltnHf0L5nm\nRczj2AllTuEof9J/T4tqaNZeHArdL9f7ECStBm4F+oA7IuKmouXzgc8DrwIOAZdHxHhathFYDxwH\n3hMROwvW6wNGgYMR8bY5701O5e41+Hc8zjhn8SMWchaH+JP+e7i0/++bVS2z2jTx/RvjAzA0tY1K\nz/eaCQUPSneWioGQTtqbgTcDE8AeSSMR8VhBsfXA4YhYJmktsAm4XNIKYC1wDnAW8FVJryh4r/J7\nyd7TfFrd9qiC+6ZfW+bGM/FNzuWHA+9oVlXMOtTMa3dnDwXwy3U6TZ4uo5XAWEQ8GRHHgO3AmqIy\na4CZjvYdwCpJSvO3R8TRiNgPjKXtIWkxcDFwx9x3I59Kl5bOfg+CmQHpvR/538XuLqTOkecMuAg4\nUDA9keaVLBMR08ARYGGFdW8B/hT49WxfLulqSaOSRicnJ3NUt7T7pl/L+6f/0yyXlmZ3IZtZZdW+\nDMqh0Bla8iexpLcBP4mIhyuVjYgtETEcEcODg4M1fd993z7Ixul3c5y+2b7JdyGbVcFdQd0nTyAc\nBJYUTC9O80qWkdQPLCAbXC637uuASySNk3VBvVHS3TXUv6KhDffzvnu+M2vLwPcamNXGL9fpLnkC\nYQ+wXNJSSfPIBolHisqMADNvk7kM2BURkeavlTRf0lJgOfCtiNgYEYsjYihtb1dE1P2FxHn+AZ7C\nUW7p3+wwMJsDh0J3qBgIaUzgWmAn2RVB90bEXkk3SrokFdsKLJQ0BlwHbEjr7gXuBR4DHgSuKbjC\nqOX6OM7H+j/tS0vN6sCh0PmU/SHfGYaHh2N0dDR3+dn+4Z3CUYeB2VzccKTk7LwnfI9BNI+khyNi\nuFK5Hr3OMhwGZg3ilkLn6sFACCAcBmZtwKHQXro6EE78SyWe/6n2Gmozq0413UEOhfbR1WMIJ2ji\n817MLDM0dTfZkwFme8xFlX+olRm7sPI8hmBmLffCYy5m+8MzC4wsPKyVHAhm1lAOhc7hQDCzhqsu\nFLY5GFrEgWBmTVFNKLi10Bq5XpBjZlYP4wNXFZzoZxtozpYNTd3FePE7Slp5gUiXD2i7hWBmTZWv\npQBZKLyIoaltnD3VtNem9DQHgpk13YmhULkLaYpTHApN4EAws5YYH7iK8YErydtamOIUjys0mAPB\nzFqqui4kDzY3kgPBzFouC4Vf41BoLQeCmbWF7GqimVDw/Qqt4EAws7YxPvAOxgeuZIBf4vsVms+B\nYGZt54mBd1HNuMLSKb8Ctx5y3ZgmaTVwK9AH3BERNxUtnw98HngVcAi4PCLG07KNwHrgOPCeiNgp\naUkqfybZ//EtEXFrXfbIzLpCNTexBf0MTW1L0w18xH2rbopr0g1xFVsIkvqAzcBFwArgCkkrioqt\nBw5HxDLgZmBTWncFsBY4B1gNfCptbxp4f0SsAC4ArimxTTPrcdVegeRupLnJ02W0EhiLiCcj4hiw\nHVhTVGYNMNNm2wGskqQ0f3tEHI2I/cAYsDIinoqIRwAi4ufA48Ciue+OmXWb/DexzXAo1CpPICwC\nDhRMT3Dyyfv5MhExDRwBFuZZV9IQcB6wO3+1zayXVHcTG/hKpNq0dFBZ0kuAvwLeFxHPlClztaRR\nSaOTk5PNraCZtZXxgasQ01QTCjPB4IHnyvIEwkFgScH04jSvZBlJ/cACssHlsutK+g2yMNgWEV8q\n9+URsSUihiNieHBwMEd1zayb7R9YVxAK+ccXgn6HQgV5AmEPsFzSUknzyAaJR4rKjADr0ufLgF2R\nvax5BFgrab6kpcBy4FtpfGEr8HhEfKIeO2JmvWP/wDrGB66suhspuxrJXUjlVAyENCZwLbCTbPD3\n3ojYK+lGSZekYluBhZLGgOuADWndvcC9wGPAg8A1EXEceB3wDuCNkr6Tft5a530zsx6Q/0ok8NjC\n7JT9Id8ZhoeHY3R0tLaVW/lSDTNruOwEP3O/wmz3LcyI5//bsPsW6mWO9yFIejgihiuV8xvTzKwr\nzJzUT/zLv/Jb2bJ1mnBTWwfwoyvMrKvMXKJay9VIvd6d5EAws660f2Ad+ccWZvR2MDgQzKxrVX+X\n84zeDAYHgpl1tZPvcq49GJZNfbYRVWwbHlQ2s55QetAZ8l2RlJWZZl4agO7OwWcHgpn1lMITef4r\nkk4u043B4EAws55V/aWqJ5d54ZLVGZ0bEg4EM+t59QiGQp3aenAgmJkltQdDoXKth/YPCAeCmVmR\nmRP3sqnPMs28NLf2YJhRGBCn8XMeHfiPNdawMRwIZmZljA38ITDXFkPp9Z7hX5w0/tDqkHAgmJlV\nMLdLVss5ed2TQ6K53UwOBDOznMpfsjpjLgFRev2hqW2w4X4ATpvfx6MfWT3H7yjPgWBmVoPiv9zr\n061UrKib6ehxfvu/P9iwUPCjK8zM6mDmERmn8XNOfExGfd8588zR43XdXiG3EMzM6qh4ULj8g/Hq\n1Yqon1wtBEmrJe2TNCZpQ4nl8yXdk5bvljRUsGxjmr9P0lvybtPMrBvMtBwKf+A4jWxF1KpiC0FS\nH7AZeDMwAeyRNBIRjxUUWw8cjohlktYCm4DLJa0A1gLnAGcBX5X0irROpW2amXWl8YF3njB94v0O\nM0q3IE6b39egWuXrMloJjEXEkwCStgNrgMKT9xrghvR5B3CbJKX52yPiKLBf0ljaHjm2aWbWE2bu\nd5jx5qmP8QP+dcGcLBza4SqjRcCBgukJ4NXlykTEtKQjwMI0/5tF6y5Knytt08ysJ31lYOOJM244\n0pTvbftBZUlXA1enyWcl7atxU2cAT9enVg3nujZGp9S1U+oJrmujnFjXj8x5APrf5CmUJxAOAksK\npheneaXKTEjqBxYAhyqsW2mbAETEFmBLjnrOStJoRAzPdTvN4Lo2RqfUtVPqCa5ro7SqrnmuMtoD\nLJe0VNI8skHikaIyI8C69PkyYFdERJq/Nl2FtBRYDnwr5zbNzKyJKrYQ0pjAtcBOoA/4TETslXQj\nMBoRI8BW4K40aPxTshM8qdy9ZIPF08A1EXEcoNQ26797ZmaWV64xhIh4AHigaN71BZ+ngLeXWffP\ngT/Ps80Gm3O3UxO5ro3RKXXtlHqC69ooLamrsp4dMzPrdX6WkZmZAT0QCO38iAxJSyQ9JOkxSXsl\nvTfNf5mkr0j6QfrvS1td1xmS+iR9W9Lfpuml6XElY+nxJcW3W7aEpNMl7ZD0hKTHJb2mXY+rpD9O\n//+/J+kLkgba5bhK+oykn0j6XsG8ksdRmU+mOj8q6fw2qOtfpH8Dj0r6a0mnFywr+VidVtW1YNn7\nJYWkM9J0045rVwdCwWM3LgJWAFekx2m0i2ng/RGxArgAuCbVbwPwtYhYDnwtTbeL9wKPF0xvAm6O\niGXAYbLHmLSDW4EHI+Js4HfI6tx2x1XSIuA9wHBEnEt2kcXM41/a4bh+Dii+NbbccbyI7ErC5WT3\nDt3epDrO+Bwn1/UrwLkR8dvA94GNAEWP1VkNfCqdL5rlc5xcVyQtAS4E/qlgdtOOa1cHAgWP3YiI\nY8DMIzLaQkQ8FRGPpM8/JztpLSKr452p2J3Apa2p4YkkLQYuBu5I0wLeSPa4EmiTukpaALye7Oo3\nIuJYRPyMNj2uZBd3nJLu4TkVeIo2Oa4R8X/JrhwsVO44rgE+H5lvAqdL+s3m1LR0XSPiyxExnSa/\nSXbP00xdt0fE0YjYDxQ+VqcldU1uBv6UE59217Tj2u2BUOqxG4vKlG0pZU+IPQ/YDZwZEU+lRT8G\nzmxRtYrdQvaP9ddpeiHws4JfuHY5vkuBSeCzqXvrDkkvpg2Pa0QcBD5O9hfhU8AR4GHa87jOKHcc\n2/337Y+Av0uf266uktYAByPiH4oWNa2u3R4IHUHSS4C/At4XEc8ULks3+LX8UjBJbwN+EhEPt7ou\nOfQD5wO3R8R5wHMUdQ+10XF9KdlfgEvJngj8Ykp0JbSrdjmOlUj6IFkX7bZKZVtB0qnAfwOur1S2\nkbo9EPI8dqOlJP0GWRhsi4gvpdn/PNMkTP/9SavqV+B1wCWSxsm63t5I1k9/eurqgPY5vhPARETs\nTtM7yAKiHY/rm4D9ETEZEb8CvkR2rNvxuM4odxzb8vdN0h8AbwOujBeus2+3ur6c7I+Cf0i/Y4uB\nRyT9K5pY124PhLZ+REbqg98KPB4RnyhYVPgokHXA3zS7bsUiYmNELI6IIbLjuCsirgQeIntcCbRP\nXX8MHJD0W2nWKrK75dvuuJJ1FV0g6dT072Gmrm13XAuUO44jwDvTVTEXAEcKupZaQtJqsm7OSyLi\nFwWLyj1WpyUi4rsR8S8jYij9jk0A56d/y807rhHR1T/AW8muLvgh8MFW16eobr9L1tx+FPhO+nkr\nWd/814AfAF8FXtbquhbV+w3A36bP/5bsF2kM+CIwv9X1S/V6JTCaju19wEvb9bgCHwGeAL4H3AXM\nb5fjCnyBbGzjV2QnqfXljiPZQ/s3p9+175JdOdXquo6R9b/P/H79r4LyH0x13Qdc1Oq6Fi0fB85o\n9nH1ncpmZgZ0f5eRmZnl5EAwMzPAgWBmZokDwczMAAeCmZklDgQzMwMcCGZmljgQzMwMgP8Pq3ia\nTtcs64QAAAAASUVORK5CYII=\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x268d07e19b0>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import numpy as np\n",
    "import scipy.stats as stats\n",
    "import pylab as pl\n",
    "h = sorted([len(sentence) for sentence in totalX])\n",
    "maxLength = h[int(len(h) * 0.60)]\n",
    "print(\"Max length is: \",h[len(h)-1])\n",
    "print(\"60% cover length up to: \",maxLength)\n",
    "h = h[:5000]\n",
    "fit = stats.norm.pdf(h, np.mean(h), np.std(h))  #this is a fitting indeed\n",
    "\n",
    "pl.plot(h,fit,'-o')\n",
    "pl.hist(h,normed=True)      #use this to draw histogram of your data\n",
    "pl.show() "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Words to tokens, pad each input sequence to max input length if it is shorter\n",
    "Save the input tokenizer, since we need to use the same tokenizer for our new predition data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "input vocab_size: 22124\n"
     ]
    }
   ],
   "source": [
    "totalX = [\" \".join(wordslist) for wordslist in totalX]  # Keras Tokenizer expect the words tokens to be seperated by space \n",
    "input_tokenizer = Tokenizer(30000) # Initial vocab size\n",
    "input_tokenizer.fit_on_texts(totalX)\n",
    "vocab_size = len(input_tokenizer.word_index) + 1\n",
    "print(\"input vocab_size:\",vocab_size)\n",
    "totalX = np.array(pad_sequences(input_tokenizer.texts_to_sequences(totalX), maxlen=maxLength))\n",
    "__pickleStuff(\"./data/input_tokenizer_chinese.p\", input_tokenizer)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Output be an array of 0s and 1s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "output vocab_size: 3\n"
     ]
    }
   ],
   "source": [
    "target_tokenizer = Tokenizer(3)\n",
    "target_tokenizer.fit_on_texts(totalY)\n",
    "print(\"output vocab_size:\",len(target_tokenizer.word_index) + 1)\n",
    "totalY = np.array(target_tokenizer.texts_to_sequences(totalY)) -1\n",
    "totalY = totalY.reshape(totalY.shape[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 1, 0, 0, 0, 0, 1, 1, 0, 0])"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "totalY[40:50]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Trun output array 0s and 1s to categories as one-hot vectors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "totalY = to_categorical(totalY, num_classes=2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 0.,  1.],\n",
       "       [ 0.,  1.],\n",
       "       [ 1.,  0.],\n",
       "       [ 1.,  0.],\n",
       "       [ 1.,  0.],\n",
       "       [ 1.,  0.],\n",
       "       [ 0.,  1.],\n",
       "       [ 0.,  1.],\n",
       "       [ 1.,  0.],\n",
       "       [ 1.,  0.]])"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "totalY[40:50]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "output_dimen = totalY.shape[1] # which is 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Save meta data for later predition"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "target_reverse_word_index = {v: k for k, v in list(target_tokenizer.word_index.items())}\n",
    "sentiment_tag = [target_reverse_word_index[1],target_reverse_word_index[2]] # either [\"neg\",\"pos\"] or [\"pos\",\"neg\"]\n",
    "metaData = {\"maxLength\":maxLength,\"vocab_size\":vocab_size,\"output_dimen\":output_dimen,\"sentiment_tag\":sentiment_tag}\n",
    "__pickleStuff(\"./data/meta_sentiment_chinese.p\", metaData)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Build the Model, train and save it\n",
    "The training data is logged to Tensorboard, we can look at it by cd into directory \n",
    "\n",
    "\"./Graph/sentiment_chinese\" and run\n",
    "\n",
    "\n",
    "\"python -m tensorflow.tensorboard --logdir=.\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "_________________________________________________________________\n",
      "Layer (type)                 Output Shape              Param #   \n",
      "=================================================================\n",
      "embedding_1 (Embedding)      (None, 68, 256)           5663744   \n",
      "_________________________________________________________________\n",
      "gru_1 (GRU)                  (None, 68, 256)           393984    \n",
      "_________________________________________________________________\n",
      "gru_2 (GRU)                  (None, 256)               393984    \n",
      "_________________________________________________________________\n",
      "dense_1 (Dense)              (None, 2)                 514       \n",
      "=================================================================\n",
      "Total params: 6,452,226\n",
      "Trainable params: 6,452,226\n",
      "Non-trainable params: 0\n",
      "_________________________________________________________________\n",
      "Train on 5324 samples, validate on 592 samples\n",
      "Epoch 1/20\n",
      "5324/5324 [==============================] - 23s - loss: 0.6747 - acc: 0.5697 - val_loss: 0.5101 - val_acc: 0.7584\n",
      "Epoch 2/20\n",
      "5324/5324 [==============================] - 22s - loss: 0.5146 - acc: 0.7611 - val_loss: 0.3959 - val_acc: 0.8294\n",
      "Epoch 3/20\n",
      "5324/5324 [==============================] - 22s - loss: 0.3989 - acc: 0.8330 - val_loss: 0.3519 - val_acc: 0.8429\n",
      "Epoch 4/20\n",
      "5324/5324 [==============================] - 22s - loss: 0.3248 - acc: 0.8770 - val_loss: 0.3605 - val_acc: 0.8682\n",
      "Epoch 5/20\n",
      "5324/5324 [==============================] - 22s - loss: 0.2834 - acc: 0.8944 - val_loss: 0.3100 - val_acc: 0.8716\n",
      "Epoch 6/20\n",
      "5324/5324 [==============================] - 23s - loss: 0.2544 - acc: 0.9038 - val_loss: 0.3331 - val_acc: 0.8733\n",
      "Epoch 7/20\n",
      "5324/5324 [==============================] - 22s - loss: 0.2251 - acc: 0.9149 - val_loss: 0.3565 - val_acc: 0.8818\n",
      "Epoch 8/20\n",
      "5324/5324 [==============================] - 22s - loss: 0.1871 - acc: 0.9281 - val_loss: 0.3329 - val_acc: 0.8801\n",
      "Epoch 9/20\n",
      "5324/5324 [==============================] - 23s - loss: 0.1685 - acc: 0.9393 - val_loss: 0.2901 - val_acc: 0.8868\n",
      "Epoch 10/20\n",
      "5324/5324 [==============================] - 23s - loss: 0.1374 - acc: 0.9500 - val_loss: 0.3199 - val_acc: 0.8986\n",
      "Epoch 11/20\n",
      "5324/5324 [==============================] - 23s - loss: 0.1296 - acc: 0.9529 - val_loss: 0.3580 - val_acc: 0.8970\n",
      "Epoch 12/20\n",
      "5324/5324 [==============================] - 22s - loss: 0.1170 - acc: 0.9570 - val_loss: 0.3744 - val_acc: 0.9020\n",
      "Epoch 13/20\n",
      "5324/5324 [==============================] - 22s - loss: 0.1268 - acc: 0.9540 - val_loss: 0.3470 - val_acc: 0.8885\n",
      "Epoch 14/20\n",
      "5324/5324 [==============================] - 22s - loss: 0.0978 - acc: 0.9628 - val_loss: 0.4038 - val_acc: 0.8986\n",
      "Epoch 15/20\n",
      "5324/5324 [==============================] - 22s - loss: 0.0914 - acc: 0.9649 - val_loss: 0.4046 - val_acc: 0.8834\n",
      "Epoch 16/20\n",
      "5324/5324 [==============================] - 22s - loss: 0.0846 - acc: 0.9711 - val_loss: 0.3720 - val_acc: 0.8970\n",
      "Epoch 17/20\n",
      "5324/5324 [==============================] - 22s - loss: 0.0898 - acc: 0.9641 - val_loss: 0.3941 - val_acc: 0.8851\n",
      "Epoch 18/20\n",
      "5324/5324 [==============================] - 22s - loss: 0.0816 - acc: 0.9677 - val_loss: 0.3261 - val_acc: 0.8851\n",
      "Epoch 19/20\n",
      "5324/5324 [==============================] - 22s - loss: 0.0765 - acc: 0.9724 - val_loss: 0.4196 - val_acc: 0.8851\n",
      "Epoch 20/20\n",
      "5324/5324 [==============================] - 22s - loss: 0.0776 - acc: 0.9699 - val_loss: 0.4752 - val_acc: 0.8851\n",
      "Saved model!\n"
     ]
    }
   ],
   "source": [
    "embedding_dim = 256\n",
    "\n",
    "model = Sequential()\n",
    "model.add(Embedding(vocab_size, embedding_dim,input_length = maxLength))\n",
    "# Each input would have a size of (maxLength x 256) and each of these 256 sized vectors are fed into the GRU layer one at a time.\n",
    "# All the intermediate outputs are collected and then passed on to the second GRU layer.\n",
    "model.add(GRU(256, dropout=0.9, return_sequences=True))\n",
    "# Using the intermediate outputs, we pass them to another GRU layer and collect the final output only this time\n",
    "model.add(GRU(256, dropout=0.9))\n",
    "# The output is then sent to a fully connected layer that would give us our final output_dim classes\n",
    "model.add(Dense(output_dimen, activation='softmax'))\n",
    "# We use the adam optimizer instead of standard SGD since it converges much faster\n",
    "tbCallBack = TensorBoard(log_dir='./Graph/sentiment_chinese', histogram_freq=0,\n",
    "                            write_graph=True, write_images=True)\n",
    "model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])\n",
    "model.summary()\n",
    "model.fit(totalX, totalY, validation_split=0.1, batch_size=32, epochs=20, verbose=1, callbacks=[tbCallBack])\n",
    "model.save('./data/sentiment_chinese_model.HDF5')\n",
    "\n",
    "print(\"Saved model!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Below are prediction code\n",
    "Function to load the meta data and the model we just trained."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "model = None\n",
    "sentiment_tag = None\n",
    "maxLength = None\n",
    "def loadModel():\n",
    "    global model, sentiment_tag, maxLength\n",
    "    metaData = __loadStuff(\"./data/meta_sentiment_chinese.p\")\n",
    "    maxLength = metaData.get(\"maxLength\")\n",
    "    vocab_size = metaData.get(\"vocab_size\")\n",
    "    output_dimen = metaData.get(\"output_dimen\")\n",
    "    sentiment_tag = metaData.get(\"sentiment_tag\")\n",
    "    embedding_dim = 256\n",
    "    if model is None:\n",
    "        model = Sequential()\n",
    "        model.add(Embedding(vocab_size, embedding_dim, input_length=maxLength))\n",
    "        # Each input would have a size of (maxLength x 256) and each of these 256 sized vectors are fed into the GRU layer one at a time.\n",
    "        # All the intermediate outputs are collected and then passed on to the second GRU layer.\n",
    "        model.add(GRU(256, dropout=0.9, return_sequences=True))\n",
    "        # Using the intermediate outputs, we pass them to another GRU layer and collect the final output only this time\n",
    "        model.add(GRU(256, dropout=0.9))\n",
    "        # The output is then sent to a fully connected layer that would give us our final output_dim classes\n",
    "        model.add(Dense(output_dimen, activation='softmax'))\n",
    "        # We use the adam optimizer instead of standard SGD since it converges much faster\n",
    "        model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])\n",
    "        model.load_weights('./data/sentiment_chinese_model.HDF5')\n",
    "        model.summary()\n",
    "    print(\"Model weights loaded!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Functions to convert sentence to model input, and predict result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def findFeatures(text):\n",
    "    text=Converter('zh-hans').convert(text)\n",
    "    text = text.replace(\"\\n\", \"\")\n",
    "    text = text.replace(\"\\r\", \"\") \n",
    "    seg_list = jieba.cut(text, cut_all=False)\n",
    "    seg_list = list(seg_list)\n",
    "    text = \" \".join(seg_list)\n",
    "    textArray = [text]\n",
    "    input_tokenizer_load = __loadStuff(\"./data/input_tokenizer_chinese.p\")\n",
    "    textArray = np.array(pad_sequences(input_tokenizer_load.texts_to_sequences(textArray), maxlen=maxLength))\n",
    "    return textArray\n",
    "def predictResult(text):\n",
    "    if model is None:\n",
    "        print(\"Please run \\\"loadModel\\\" first.\")\n",
    "        return None\n",
    "    features = findFeatures(text)\n",
    "    predicted = model.predict(features)[0] # we have only one sentence to predict, so take index 0\n",
    "    predicted = np.array(predicted)\n",
    "    probab = predicted.max()\n",
    "    predition = sentiment_tag[predicted.argmax()]\n",
    "    return predition, probab"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Calling the load model function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "_________________________________________________________________\n",
      "Layer (type)                 Output Shape              Param #   \n",
      "=================================================================\n",
      "embedding_3 (Embedding)      (None, 68, 256)           5663744   \n",
      "_________________________________________________________________\n",
      "gru_5 (GRU)                  (None, 68, 256)           393984    \n",
      "_________________________________________________________________\n",
      "gru_6 (GRU)                  (None, 256)               393984    \n",
      "_________________________________________________________________\n",
      "dense_3 (Dense)              (None, 2)                 514       \n",
      "=================================================================\n",
      "Total params: 6,452,226\n",
      "Trainable params: 6,452,226\n",
      "Non-trainable params: 0\n",
      "_________________________________________________________________\n",
      "Model weights loaded!\n"
     ]
    }
   ],
   "source": [
    "loadModel()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Try some new comments, feel free to try your own\n",
    "The result tuple consists the predicted result and likehood."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('pos', 0.99807882)"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "predictResult(\"还好，床很大而且很干净，前台很友好，很满意，下次还来。\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('neg', 0.77338696)"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "predictResult(\"床上有污渍，房间太挤不透气，空调不怎么好用。\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('pos', 0.98943686)"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "predictResult(\"房间有点小但是设备还齐全，没有异味。\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('neg', 0.73937547)"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "predictResult(\"房间还算干净，一般般吧，短住还凑合。\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('neg', 0.65142196)"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "predictResult(\"开始不太满意，前台好说话换了一间，房间很干净没有异味。\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
